Hi everyone! For my first Medium post I wanted to do something fun and interesting, and to provide a real-world example of the power of
Background: The on-going COVID-19 pandemic has taken more than 4 million human lives worldwide since it started on 2020. Novel vaccines were developed at the end of 2020 to reduce the mortality rate associated with this respiratory infectious disease. There has been a different level of vaccine adoption among countries, so the key question this report seeks to answer is whether countries with higher vaccination rates are associated with lower COVID-19 case fatality rate.
Method: Using country and monthly case, death and vaccination data related to COVID-19, a generalized linear model with a negative binomial distribution was fitted, to estimate the effect of vaccination at country level in reducing the case fatality rate (CFR) of COVID-19. The model was controlled by each country heterogeneity and the time development of the COVID-19 disease.
Results: A statistical significant relationship was found, higher vaccination rates in a country reduces the relative risk of COVID-19. In comparison to the group with 0%-25% vaccinated people, countries in the 25%-50% group have a 19% [C.I. 95%: -27%;-11%] lower CFR for COVID-19, countries in the 50%-75% group have a 27% lower CFR [-35%;-18%] and countries in the 75%-100% group have a 37% lower CFR [-48%;-24%]. Additionally, a 10% increase in the percentage of vaccinated people results in a reduction of the COVID-19 CFR of 6.56% [C.I. 95%: -8.48%;-4.61%]. Potential limitations of the analysis are due to the potential presence of time and spatial autocorrelation in the data.
Conclusion: A negative association between higher vaccination rates and lower CFR for COVID-19 was found. Countries with current low vaccination rates should prioritize a policy to vaccinate their population with the goal to reduce the death toll of the COVID-19 pandemic.
The on-going COVID-19 pandemic has taken more than 4 million human lives worldwide since it started on 2020. Novel vaccines were developed at the end of 2020 to reduce the mortality rate associated with this respiratory infectious disease. There has been a different level of vaccine adoption among countries, so the key question to answer is whether country with higher vaccination rates among their population are associated with lower COVID-19 case fatality rates (CFR).
MENCIONAR ESTUDIO RCT EXISTENTES QE MUESTRAN LA EFICACIA DE VACUNAS COVID
As an international organization with the mission to improve human health, the World Health Organization (WHO) has been compiling daily country level data for the COVID-19 pandemic. This data is easily accessible to the general public (link to download). The data contains COVID-19 daily related cases and deaths for a variety of countries. Here are the resolutions:
MEJORAR TEXTO: mas explicito
We are interested in exploring the effect of vaccines on the case fatality rate for COVID-19. Our World in Data is an organization that compiles country level data from multiple sources to make research more easy. They have a made a strong effort to compile a database with the total number of people fully vaccinated (people who received all doses prescribed by the vaccination protocol) by date and country. The vaccination data is available here, with a public GitHub.
For this project the variable to evaluate vaccination status was the total people fully vaccinated per 100 people in the country, or the percentage of people fully vaccinated. This metric was chosen for several reasons:
The virus causes an inflammatory respiratory stress on the human body. In the two years of the pandemic, approximately 5 million people have died, and more than 400 million have been infected. Several countries have taken different methods to mitigate the dispersion and impact of the pandemic, such as mobility restrictions (quarantines), social distancing measures, mask enforcement. A little more than a year ago (end of 2020) novel vaccines against COVID-19 were developed, which have proven useful to reduce the death rate due to COVID-19.
Several studies have found a negative association between vaccination and case fatality rate of COVID-19 (Liang et al., 2021; Passarelli-Araujo et al., 2022). In particular, Liang et al. (2021) found that a 10% increase in the vaccination among the population decrease the CFR of COVI-19 by 7.6%. This study uses a panel data consisting of 90 countries over 25 weeks, fitting a country-level random effects model (Liang et al., 2021). The variable for vaccination level was the number of people per 100 habitants that have received at least one dose of the vaccine. The question remains open to study the effect of people fully vaccinated (scheme completed), which this report seeks to answer.
Both datasets (WHO and OurWorldinData) were merged based in the country name, resulting in a match of 167 countries in total. For these countries we got reliable information on COVID-19 cases and deaths, and vaccination status for a given number of months, that may differ according to the start date of vaccination campaigns in each country.
As the COVID-19 disease usually takes two weeks to fully develop into a mortal disease, a two week shift in the death data (bring backward) was considered to align cases with deaths in the estimation of the monthly CFR for each country.
To improve the estimation and robustness of the analysis, the CFR was estimated for each country only in the months in which there were at least 100 cases and more than 10 COVID-19 deaths.
PQ MES: AGREGAR DATOS DE MUERTES/COVID Y ESTADO VACUNACION
MENCIONAR QUE SURVIVAL ANALYSIS NO APLICA
The time series data for COVID-19 cases and deaths, and vaccination rates per country was converted to a panel data: that is a cross-sectional ecological study (main observational unit is the monthly variables per country). CFR and vaccination rates were summarized to monthly averages to balance the trade-off between number of observations and accuracy of predictions. A monthly average allow us to have approximately 10 observations per country, and by averaging for the whole month we are reducing the random daily noise.
The model to be analyzed will be a generalized linear model, using panel data (ecological cross-sectional with a time series):
\[ log(Y_{i,t})=\beta_0+\beta_1 (Vaccination \: status)_{i,t}+\beta_{2,i} (Country)_i+\beta_{3,t} (Month)_t+\epsilon_{i,t} \]
Where:
The outcome variable in the model corresponds to number of deaths normalized by total cases of COVID-19 (the case fatality rate). The number of deaths were transformed using a negative binomial distribution with a logarithmic link function, as the negative binomial is a better fit for count data than the Poisson when there is over-dispersion, as it allows the variance to be different from the mean. The negative binomial distribution have been used in other studies for count data with over-dispersion, such as deaths (Haldar & Sethi, 2020).
Each observation in the model is weighted in the regression using the total number of cases per month, which helps to resolve the issue of differences in size between countries. An analysis of the distribution of the outcome variable and the appropriateness for a logarithmic transformation is provided in the Appendix section.
The major assumptions of the model are:
The main hypothesis to test is the effect of each vaccination level on the CFR for COVID-19, that is to test \(H_0:\beta_1=0\) vs \(H_A:\beta_1 \neq 0\). A likelihood ratio test is conduct to test whether the coefficient associated to vaccination status is significant. Furthermore, the coefficients are present along with their confidence intervals at 95% level. This model specification can only provide an answer regarding association of vaccination rates and CFR at country level, and does not provide an statement about causal inference nor efficacy of the vaccine at the individual level.
To interpret the results, the coefficients are presented as relative risk (RR). This represents the relative change in the case fatality rate for COVID-19. For example, a coefficient with a RR of 0.9, means that this variable reduces in 10% the mortality rate for COVID-19, while controlling by other variables in the model.
All tests on the residuals of the generalized linear model were conducted using a simulation-based approach using the DHARMa library in R (Hartig, 2022). The following test were conducted: dispersion of residuals and a simulated QQ-plot. Results from this test are presented in the appendix.
In addition to testing the main model, several scenarios were conducted to analyze results in more robust way:
# libraries
library(tidyverse) # data manipulation
library(scales)
library(lubridate) #time format
library(gridExtra) # plot manipulation
library(plotly) # interactive plots
library(skimr) # summary statistics
library(flextable) #table format for hhtml
library(DHARMa) # Simulations test for GLM
theme_set(theme_bw(16)+theme(panel.grid.major = element_blank()))
# COVID-19 Data -----
# load WHO COVID-19 data
covid <- read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")
# Data wrangling (preparation for analysis)
# shift deaths two weeks early
covid <- covid %>%
group_by(Country) %>%
mutate(deaths_shift=lead(New_deaths,14,
order_by = Date_reported)) %>%
ungroup()
# filter to remove March 2022 - Updated to August
covid <- covid %>% filter(Date_reported<"2022-08-01")
## Create time factor variable, splitting years in half
halfYear_levels <- c("2020-1","2020-2","2021-1",
"2021-2","2022-1","2022-2")
month_levels <- c("2020-12","2021-1","2021-2","2021-3",
"2021-4","2021-5","2021-6","2021-7","2021-8",
"2021-9","2021-10","2021-11","2021-12",
"2022-1","2022-2","2022-3","2022-4","2022-5",
"2022-6","2022-7")
covid <- covid %>%
mutate(period=case_when(
Date_reported < as.Date("2020-07-01") ~ "2020-1",
Date_reported < as.Date("2021-01-01") ~ "2020-2",
Date_reported < as.Date("2021-07-01") ~ "2021-1",
Date_reported < as.Date("2022-01-01") ~ "2021-2",
Date_reported < as.Date("2022-07-01") ~ "2022-1",
Date_reported < as.Date("2023-01-01") ~ "2022-2",
T ~ "other") %>% factor(levels=halfYear_levels),
month=paste(year(Date_reported),month(Date_reported),sep="-") %>%
factor(levels =month_levels))
# clean data: remove negative cases and deaths
covid <- covid %>%
mutate(New_deaths=deaths_shift) %>%
filter(New_cases>=0 & New_deaths>=0)
## need to summarize COVID data to periods of time: MONTHS
covid_period <- covid %>%
group_by(WHO_region,Country,month) %>%
summarise(cases=sum(New_cases),
deaths=sum(New_deaths)) %>% ungroup() %>%
filter(cases>99 & deaths>9) %>% # filter to remove unreliable periods
mutate(cfr=deaths/cases) # we calculate CFR for each given month
# VACCINATION DATA -----
## Load vaccination data
vaccination <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv")
# Filter to remove March 2022 - Updated to August 2022
vaccination <- vaccination %>%
filter(date<"2022-08-01")
# We summarized vaccination data based on the a time period: MONTH
## Create time factor variable, splitting years in half
vaccination <- vaccination %>%
mutate(period=case_when(
date < as.Date("2020-07-01") ~ "2020-1",
date < as.Date("2021-01-01") ~ "2020-2",
date < as.Date("2021-07-01") ~ "2021-1",
date < as.Date("2022-01-01") ~ "2021-2",
date < as.Date("2022-07-01") ~ "2022-1",
date < as.Date("2023-01-01") ~ "2022-2",
T ~ "other") %>% factor(levels=halfYear_levels),
month=paste(year(date),month(date),sep="-") %>% factor(levels=month_levels))
# take the average over the period: NA is 0
vaccination_period <- vaccination %>%
mutate(vaccinated=replace_na(people_fully_vaccinated_per_hundred,0)) %>%
group_by(location,month) %>%
summarise(vaccinated=mean(vaccinated,na.rm=T)) %>% ungroup()
# JOIN DATA -----
## Change countries name to increase matching (manually, based on visual inspection)
covid <- covid %>%
mutate(Country=case_when(
Country=="Bolivia (Plurinational State of)" ~ "Bolivia",
Country=="Bonaire" ~ "Bonaire Sint Eustatius and Saba",
Country=="Brunei Darussalam" ~ "Brunei",
Country=="Cabo Verde" ~ "Cape Verde",
Country=="Côte d’Ivoire" ~ "Cote d'Ivoire",
# str_detect(Country,"C.te d.Ivoire") ~ "Cote d'Ivoire",
Country=="Curaçao" ~ "Curacao",
# str_detect(Country,"Cura.ao") ~ "Curacao",
Country=="Democratic Republic of the Congo" ~ "Democratic Republic of Congo",
Country=="Falkland Islands (Malvinas)" ~ "Falkland Islands",
Country=="Faroe Islands" ~ "Faeroe Islands",
Country=="Iran (Islamic Republic of)" ~ "Iran",
Country=="Kosovo[1]" ~ "Kosovo",
Country=="Lao People's Democratic Republic" ~ "Laos",
Country=="occupied Palestinian territory, including east Jerusalem" ~ "Palestine",
Country=="Pitcairn Islands" ~ "Pitcairn",
Country=="Republic of Korea" ~ "South Korea",
Country=="Republic of Moldova" ~ "Moldova",
Country=="Russian Federation" ~ "Russia",
Country=="Sint Maarten" ~ "Sint Maarten (Dutch part)",
Country=="Syrian Arab Republic" ~ "Syria",
Country=="The United Kingdom" ~ "United Kingdom",
Country=="Timor-Leste" ~ "Timor",
Country=="United Republic of Tanzania" ~ "Tanzania",
Country=="United States of America" ~ "United States",
Country=="Venezuela (Bolivarian Republic of)" ~ "Venezuela",
Country=="Viet Nam" ~ "Vietnam",
T ~ Country))
# Join countries from both data sources
# countries_covid <- unique(covid$Country)
# countries_vaccine <- unique(vaccination$location)
#
# countries_vaccine <- data.frame(countries_vaccine=countries_vaccine,
# x=1)
# countries_covid <- data.frame(countries_covid=countries_covid,
# y=1)
# are the codes similar?
# countries_vaccine %>%
# left_join(countries_covid,by=c("countries_vaccine"="countries_covid")) %>%
# pull(y) %>% sum(na.rm=T)
# 215 countries match
# join based on location - I join it to the Vaccination, so I can get months with vaccine information for each country
df <- vaccination_period %>%
left_join(covid_period, by=c("location"="Country",
"month")) %>%
filter(cases>=0 & deaths>=0 & !(is.na(vaccinated)))
## More convenient labels for region
df <- df %>%
mutate(region_name=case_when(
WHO_region=="EMRO" ~ "Eastern Mediterranean",
WHO_region=="EURO" ~ "Europe",
WHO_region=="AFRO" ~ "Africa",
WHO_region=="AMRO" ~ "Americas",
WHO_region=="WPRO" ~ "Western Pacific",
WHO_region=="SEARO" ~ "South-East Asia") %>%
factor(levels=c("Americas","Europe","Eastern Mediterranean",
"Western Pacific","South-East Asia","Africa")),
location=as.factor(location))
# Vaccination level factor variable
df <- df %>%
mutate(vaccinated_level=case_when(
vaccinated<25 ~ "0%-25%",
vaccinated<50 ~ "25%-50%",
vaccinated<75 ~ "50%-75%",
vaccinated<100 ~ "75%-100%",
T ~"other") %>% factor(levels=c("0%-25%","25%-50%",
"50%-75%","75%-100%")),
vaccinated_level_10=case_when(
vaccinated<10 ~ "0%-10%",
vaccinated<20 ~ "10%-20%",
vaccinated<30 ~ "20%-30%",
vaccinated<40 ~ "30%-40%",
vaccinated<50 ~ "40%-50%",
vaccinated<60 ~ "50%-60%",
vaccinated<70 ~ "60%-70%",
vaccinated<80 ~ "70%-80%",
vaccinated<90 ~ "80%-90%",
vaccinated<101 ~ "90%-100%",
T ~"other") %>% factor(levels=c("0%-10%","10%-20%","20%-30%",
"30%-40%","40%-50%","50%-60%",
"60%-70%","70%-80%","80%-90%",
"90%-100%")))
# df$vaccinated_level_10 %>% table()
Let’s provide some summary statistics for the COVID-19 cases and deaths data.
# function to estimate summary statistics:
f.sum <- function(x){
return(
data.frame(
Mean=mean(x,na.rm=T),
sd=sd(x,na.rm=T),
Min=min(x,na.rm=T),
Median=quantile(x,0.5,na.rm=T),
Max=max(x,na.rm = T)
)
)
}
# average number of months per county:
avg_country_month <- df %>%
group_by(location) %>% tally()
# Table with summary statistics
rbind(`Number of months per country`=f.sum(avg_country_month$n),
`COVID-19 Cases per Month`=f.sum(df$cases),
`COVID-19 Deaths per Month`=f.sum(df$deaths),
`Case Fatality Rate (CFR) %`=f.sum(df$cfr*100),
`People Fully Vaccinated (%)`=f.sum(df$vaccinated)) %>%
rownames_to_column() %>% rename(Metric=rowname) %>%
flextable() %>% autofit() %>%
colformat_double(digits=2) %>%
colformat_double(i=2:3,digits=0) %>%
bold(part = "header") %>%
set_caption(paste0(
"Summary Metrics: Monthly observations per country. n = ",nrow(df))) %>%
footnote(i=1,j=1,
as_paragraph(paste0(length(unique(df$location)),
" Countries considered")))
Metric | Mean | sd | Min | Median | Max |
Number of months per country1 | 12.49 | 6.36 | 1.00 | 13.00 | 20.00 |
COVID-19 Cases per Month | 155,878 | 528,778 | 101 | 23,922 | 9,284,558 |
COVID-19 Deaths per Month | 1,452 | 5,368 | 10 | 222 | 111,988 |
Case Fatality Rate (CFR) % | 1.89 | 3.04 | 0.02 | 1.12 | 76.01 |
People Fully Vaccinated (%) | 20.42 | 26.65 | 0.00 | 6.59 | 91.58 |
1168 Countries considered | |||||
rm(avg_country_month)
We observe:
#### Time series of COVID-19 Cases and deaths
p_region <- df %>%
pivot_longer(c(cases,deaths), names_to = "key", values_to = "value") %>%
ggplot(aes(month,value))+
# geom_line()+
geom_boxplot()+
facet_wrap(~key,ncol=1,scales = "free_y")+
labs(x="Date",y="")+
scale_y_continuous(labels=function(x) format(x,
big.mark = " ",
scientific = FALSE),
trans = "log10")
# p_region
p_cfr_time <- df %>%
ggplot(aes(month,cfr))+
geom_boxplot()+
scale_y_continuous(labels = scales::percent,
limits = c(0,0.1))+
labs(x="Date",y="Case Fatality Rate COVID-19 [%]")+
theme(axis.text.x = element_text(angle =-45,vjust=-0.5))
# p_cfr_time
p_vacc_time <- df %>%
ggplot(aes(month,vaccinated))+
geom_boxplot()+
labs(x="Date",y="People Fully Vaccinated per 100 people")+
theme(axis.text.x = element_text(angle =-45,vjust=-0.5))
# p_vacc_time
## combine graphs
# grid.arrange(p_cfr_time,p_vacc_time,ncol=1)
Our main outcome of interest is the case fatality rate for COVID-19. Let’s look at the total case fatality rate for COVID-19 for the entire pandemic period:
LOS PEAKS POR REGION SE PODRAN EXPLICAR POR PEAKS DE PAISES?
CURVA VERDE TIENE UN PATRON INTERESANTE
p_cfr_all<- df %>%
filter(WHO_region!="Other") %>%
ggplot(aes(region_name,cfr))+
geom_jitter(alpha=0.5,aes(text=paste0("Location: ",location,
"\n Month: ",month)))+
geom_boxplot(alpha=0.5,outlier.alpha = 0)+
coord_flip()+
scale_y_continuous(labels = scales::percent, limits=c(0,0.1))+
labs(x="WHO Region",y="Case Fatality Rate COVID-19 [%]",
caption = "Cumulated cases and deaths for the period December 2020 to July 2022 \n Only CFR below 10% are shown")
ggplotly(p_cfr_all)
# percentage above 10%
# sum(nrow(filter(df,cfr>0.1)))/nrow(df)*100
We can see that the case fatality rate is around 2% for the majority of the observations gathered, that correspond at monthly rates for each country. We observe variations across each country and across WHO regions, with Europe and Western Pacific with the lowest CFR.
Let’s look at the case fatality rate over time:
# Time series CFR - NEED TO CONSIDER WEIGHTED AVERAGE
p_cfr_time <- df %>%
group_by(month,region_name) %>%
summarize(cfr=mean(cfr,na.rm=T)) %>%
ggplot(aes(month,cfr,
col=region_name,
group=region_name
))+
geom_line(size=1.5)+
# geom_boxplot()+
scale_y_continuous(labels = scales::percent,
limits = c(0,0.1),
# breaks=c(0,0.05,0.1)
)+
theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
labs(x="",y="Case Fatality Rate COVID-19 [%]",col="WHO Region")
ggplotly(p_cfr_time)
We observe a higher value in the CFR in the early times of the pandemic, and an important reduction in the most recent times. We also observe important peaks and differences across WHO regions, probably due to different variations of the COVID-19 virus.
The next figure shows the vaccination status for each country at July 2022. We observe important differences in the access to vaccines across WHO regions, and within countries in the same region (each point in the figure represents a country). For example, Europe present a high vaccination rate, but we observe important differences in the vaccination level within the countries of Europe. The distribution of vaccination levels seems to be two-modal, with some countries with high vaccination and some with really low.
Further analysis is included in the appendix session, which shows a clear bi-modal distribution in every WHO region regarding vaccination level, except for Africa that has only low vaccinated countries.
p_vacc_all <- df %>%
filter(month=="2022-7") %>% #July 2022
ggplot(aes(region_name,vaccinated))+
geom_boxplot(alpha=0.5,outlier.alpha = 0)+
geom_jitter(alpha=0.5, size=2,
aes(text=paste0("Location: ",location)))+
coord_flip()+
labs(x="WHO Region",y="People Fully Vaccinated [%]",
caption = "Vaccination status at July 2022")
ggplotly(p_vacc_all)
# time series of vaccination - NEED TO CONSIDER WEIGHTED AVERAGE
# df %>%
# group_by(month,region_name) %>%
# summarize(vaccinated=mean(vaccinated,na.rm=T)) %>%
# ggplot(aes(month,vaccinated,
# col=region_name,
# group=region_name
# ))+
# geom_line()+
# # geom_boxplot()+
# theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
# labs(x="",y="People Fully Vaccinated per 100 people",col="WHO Region")
Let’s make a comparison between the percentage of vaccinated people and the monthly CFR for each country. This comparison directly relates to our question of interest. The following boxplot shows the case fatality rate over different levels of vaccination, divided by WHO regions:
PQ SE OBSERVE UN EFECTO EN DISTINTA DIRECCION????
PUEDE SER RUIDO, EFECTO BORDE CATEGORIA, COMO TRATAMOS LOS OUTLIERS
EXPLICAR MEJOR EL GRAFICO, QE ES CADA COSA. QUIZAS UN JITTER GRAPH ES MEJOR!
p_vac_cfr_box <- df %>%
filter(!(is.na(WHO_region))) %>%
ggplot(aes(region_name,cfr,col=vaccinated_level))+
geom_boxplot()+
scale_y_continuous(labels = scales::percent,
limits = c(0,0.1)
)+
coord_flip()+
# theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
labs(col="Percentage of People \n Fully Vaccinated [%]",
x="WHO Region",
y="Case Fatality Rate COVID-19 [%]",
caption="Each point represents a country summary statistic for each month")
# ggplotly(p_vac_cfr_box)
p_vac_cfr_box
We observe for each region a clear reduction in the CFR for COVID-19 associated with higher vaccination levels. This reduction occur both in the average value and in the variation across countries in each month. This reduction occurs in higher intensity with vaccination rates above 50%. We also observe that some regions are left unvaccinated, like Africa.
This figure seems to support our main hypothesis of reduction of CFR in countries with higher vaccination rates. Let’s construct statistical model to test this hypothesis formally.
The following table shows the fitted coefficients for the vaccination level, along with the confidence intervals at 95% level. Diagnostics of this model are presented in the appendix section, but overall they don’t support the evidence of over-dispersion, but there is some presence of outliers in the model.
We observe that countries with vaccination rates between 25%-50% have a 19% reduction in the CFR for COVID-19, compared to countries in the 0%-25% vaccination group. This reduction increases with higher vaccination rates, as the reduction for the group 50%-75% is 27% and for the group 75%-100% is 37%.
# Model
df_model <- df %>%
select(deaths,cfr,cases,vaccinated,vaccinated_level,
vaccinated_level_10,location,month) %>%
mutate(location_month=paste0(month,location)) %>%
na.omit()
# Model with categorical vaccinated level
model.nb2 <- MASS::glm.nb(deaths~vaccinated_level+
location+month+offset(log(cases)),
data = df_model)
# summary(model.nb2)
point_vac <- exp(model.nb2$coefficients[2:4])
ci_vac <- exp(confint(model.nb2,2:4, level=0.95))
# LR test
model.nb1 <- MASS::glm.nb(deaths~location+month+offset(log(cases)),
data = df_model)
test_lr <- lmtest::lrtest(model.nb1,model.nb2)
estimates <- data.frame(mid=point_vac,
lower_bound=ci_vac[,1],
upper_bound=ci_vac[,2])
names(estimates) <- c("Estimated Coefficient",
"Lower Bound C.I. 95%",
"Upper Bound C.I. 95%")
estimates <- estimates %>%
rownames_to_column() %>%
rename(Variable=rowname) %>%
mutate(Variable=str_replace(Variable,"vaccinated_level",
"Vaccination Level: "))
estimates %>%
rename(RR=`Estimated Coefficient`) %>%
flextable() %>%
autofit() %>%
colformat_double(digits=2) %>%
set_caption("Relative Risk (RR): Vaccination Level")
Variable | RR | Lower Bound C.I. 95% | Upper Bound C.I. 95% |
Vaccination Level: 25%-50% | 0.83 | 0.75 | 0.92 |
Vaccination Level: 50%-75% | 0.86 | 0.76 | 0.97 |
Vaccination Level: 75%-100% | 0.63 | 0.54 | 0.74 |
estimates %>%
mutate(Variable=str_remove(Variable,"Vaccination Level: ")) %>%
ggplot(aes(Variable, `Estimated Coefficient`))+
geom_point(col="red")+
geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
ymax=`Upper Bound C.I. 95%`))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x="Vaccination Level",y="Relative Risk (RR): Vaccination Level")
rm(estimates)
# Function to plot estimated coefficients along their error bar
f_plot_coef <- function(mod,
range=2:4,
var_name="vaccinated_level",
var_title="Vaccination Level",
title_plot=""){
point <- exp(mod$coefficients[range])
ci <- exp(confint(mod,range, level=0.95))
estimates <- data.frame(mid=point,
lower_bound=ci[,1],
upper_bound=ci[,2])
names(estimates) <- c("Estimated Coefficient",
"Lower Bound C.I. 95%",
"Upper Bound C.I. 95%")
estimates <- estimates %>%
rownames_to_column() %>%
rename(Variable=rowname) %>%
mutate(Variable=str_replace(Variable,var_name,
paste0(var_title,": ")))
# Plot
estimates %>%
mutate(Variable=str_remove(Variable,paste0(var_title,": "))) %>%
ggplot(aes(Variable, `Estimated Coefficient`))+
geom_point(col="red")+
geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
ymax=`Upper Bound C.I. 95%`))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x=var_title, y=paste0("Relative Risk (RR): ",
var_title),title = title_plot)
}
The estimated coefficients represent the relative change with respect to the reference case (Vaccination level between 0% and 25%). We can see important reduction associated with higher vaccination levels, meaning that countries with higher vaccination levels have a lower CFR, an it is an increasing relationship. Also, as the C.I. don’t cross the reference line at 1, we can reject the null hypothesis of no effect of vaccination level on CFR at 95% level. The likelihood ratio test also rejected the null hypothesis of no vaccination level effect with a p-value below \(1.45*10^{-8}\).
Let’s present also a summary of the other fitted coefficients estimated for each country and for each month:
location_coefficients <- data.frame(coef=exp(model.nb2$coefficients)) %>%
rownames_to_column() %>%
filter(str_detect(rowname,"location")) %>%
mutate(location=str_remove(rowname,"location"))
# Plot with WHO region
country_region <- df %>%
group_by(region_name,location) %>% tally() %>% ungroup()
location_coefficients %>%
left_join(country_region) %>%
ggplot(aes(region_name,coef,col=region_name))+
geom_jitter(alpha=.5)+
geom_boxplot()+
geom_hline(yintercept = 1, linetype="dashed")+
coord_flip()+
guides(col=F)+
labs(x="WHO Region",col="",y="Relative Risk (RR): Country")
size_coef <- length(model.nb2$coefficients)
# get coefficients for month
point_month <- exp(model.nb2$coefficients[(size_coef-13):(size_coef)])
ci_month <- exp(confint(model.nb2,(size_coef-13):(size_coef), level=0.95))
estimates_month <- data.frame(mid=point_month,
lower_bound=ci_month[,1],
upper_bound=ci_month[,2])
names(estimates_month) <- c("Relative Risk (RR)",
"Lower Bound C.I. 95%",
"Upper Bound C.I. 95%")
estimates_month %>%
rownames_to_column() %>%
mutate(rowname=str_remove(rowname,"month"),
month=factor(rowname, levels=rev(month_levels))) %>%
ggplot(aes(month, `Relative Risk (RR)`))+
geom_point(col="red")+
geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
ymax=`Upper Bound C.I. 95%`))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x="",y="Relative Risk (RR): Month. Reference variable: 2020-12")
Overall we observe some interesting patterns:
# OTHER MODELS ----
## Negative Binomial model - NUMERIC
model.nb <- MASS::glm.nb(deaths~vaccinated+location+month+offset(log(cases)),
data = df_model)
# summary(model.nb)
# exp(model.nb$coefficients[2]*10)
# exp(confint(model.nb,"vaccinated", level=0.95)*10)
The model with vaccination level as numeric variable shows that a 10% increase in the percentage of vaccinated people results in a reduction of the COVID-19 CFR of 6.56% [C.I. 95%: -8.48%; -4.61%], a similar result that the one obtained in Liang et al. (2021): -7.6% [C.I. 95%: -12.6%; -2.7%].
Next is presented the model with 10 categories of vaccination level and the model only considering cases where the CFR was below 10%, to remove the presence of outliers.
## Poisson model
model_poisson <- glm(deaths~vaccinated_level+location+month+offset(log(cases)),
data = df_model,
family = "poisson")
# summary(model_poisson)
# vaccinated variable
# exp(model_poisson$coefficients[2:4])
# exp(confint(model_poisson,2:4, level=0.95))
p_poisson <- f_plot_coef(model_poisson,
title_plot = "Outcome as Poisson Distribution")
## 10 levels categorical model ------
model.10levels <- MASS::glm.nb(deaths~vaccinated_level_10+
location+month+offset(log(cases)),
data = df_model)
# exp(model.10levels$coefficients[2:10])
# exp(confint(model.10levels,2:4, level=0.95))
p_10 <- f_plot_coef(model.10levels,2:10,
var_name = "vaccinated_level_10",
title_plot = "10 Categories for Vaccination Level")
# p_10
## without outliers (below 10% CFR) ------
model_noOutliers <- MASS::glm.nb(deaths~vaccinated_level+
location+month+offset(log(cases)),
data = filter(df_model,cfr<0.1))
# nobs(model_noOutliers)
# exp(model_noOutliers$coefficients[2:4])
# exp(confint(model_noOutliers,2:4, level=0.95))
p_noOut <- f_plot_coef(model_noOutliers,
title_plot = "Only CFR<10% considered")
# p_noOut
grid.arrange(p_10,p_noOut,nrow=1)
We can observe that the confidence interval differ among levels, due to the number of observations in each category. For example, not many countries have achieve a level above 90%, but even for this group the statistical evidence is strong enough to reject the null hypothesis of no effect of vaccines. We observe an expected result: the higher the vaccination rate the greater the reduction in the CFR for COVID-19.
For the model with removed observations, we got similar results as the original model, the greater the vaccination rate the lower the CFR for COVID-19. This model has narrow confidence interval, meaning that is more efficient in rejecting the null hypothesis of no effect.
Next are presented the relative risk of the mixed linear model, using country as a random effect. The confidence intervals for the RR of the original model estimated using bootstrap are also presented.
## country as random effects ------
library(lme4)
# takes forever to compute - option: load it
# model_random <- lme4::glmer.nb(deaths~(1|location)+vaccinated_level+
# month+offset(log(cases)),
# data=df_model)
# saveRDS(model_random,"randomModel.rds")
model_random <- read_rds("../randomModel.rds")
# nobs(model_random)
# ranef(model_random)
point <- exp(fixef(model_random)[2:4])
ci <- exp(confint(model_random,3:5,method="Wald", level=0.95))
estimates <- data.frame(mid=point,
lower_bound=ci[,1],
upper_bound=ci[,2])
names(estimates) <- c("Estimated Coefficient",
"Lower Bound C.I. 95%",
"Upper Bound C.I. 95%")
estimates <- estimates %>%
rownames_to_column() %>%
rename(Variable=rowname) %>%
mutate(Variable=str_replace(Variable,"vaccinated_level",
paste0("Vaccinated Level",": ")))
# Plot
p_random <- estimates %>%
mutate(Variable=str_remove(Variable,paste0("Vaccinated Level",": "))) %>%
ggplot(aes(Variable, `Estimated Coefficient`))+
geom_point(col="red")+
geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
ymax=`Upper Bound C.I. 95%`))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x="Vaccinated Level", y=paste0("Relative Risk (RR): ",
"Vaccinated Level"),
title = "Mixed Linear Effects Model (Country as Random)")
# Bootstrap
# Source: https://stackoverflow.com/questions/54749641/bootstrapping-with-glm-model
# it took really really long!
# UNCOMMENT TO GENERATE THE OBJECT AGAIN if not, load it
# data structure for results
# nboot <- 1000
# bres <- matrix(NA,
# nrow=nboot,
# ncol=4,
# dimnames=list(rep=seq(nboot),
# coef=names(coef(model.nb2))[1:4]))
# # bootstrap
# set.seed(101)
# bootsize <- nrow(df_model)
# df_boot <- df_model
# for (i in seq(nboot)) {
# bdat <- df_boot[sample(nrow(df_boot),size=bootsize,replace=TRUE),]
# bfit <- update(model.nb2, data=bdat) ## refit with new data
# bres[i,] <- coef(bfit)[1:4]
# }
# # output
# saveRDS(bres,"boot.rds")
bres <- readRDS ("../boot.rds")
boot_estimates <-
data.frame(mean_est=colMeans(exp(bres)),
t(apply(exp(bres),2,quantile,c(0.025,0.975))))
# plot
p_boot <- boot_estimates[2:4,] %>%
rownames_to_column() %>%
mutate(Variable=str_remove(rowname,"vaccinated_level")) %>%
ggplot(aes(Variable, mean_est))+
geom_point(col="red")+
geom_errorbar(aes(ymin=X2.5.,
ymax=X97.5.))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x="Vaccinated Level", y=paste0("Relative Risk (RR): ",
"Vaccinated Level"),
title = "Bootstrap estimates of Original Model")
grid.arrange(p_random,p_boot,nrow=1)
We observe the same pattern as before, a higher vaccination rate is associate it with a lower CFR for COVID-19. We notice small differences in the estimates and in the confidence intervals, but in both cases we can reject the null hypothesis that the vaccination has no effect on the CFR for COVID-19.
This project seeks to study the effect of vaccination levels in reducing the CFR for COVID-19. The main findings are that higher vaccination rates are associated with lower CFR for COVID-19, making vaccination a good policy strategy to pursue to reduce the death toll of the pandemic. The obtained results are in line with previous studies conducted (Liang et al., 2021), and provide further evidence in the efficacy of vaccination in protecting human lives.
The model suffers from some inherent problems, that future studies could explore more in deep to improve the robustness of the results:
A statistical significant relationship was found, a higher vaccination rate reduce the relative risk of COVID-19. In comparison to the group with 0%-25% vaccinated people, countries in the 25%-50% group have a 19% [C.I. 95%: -27%;-11%] reduction in the CFR for COVID-19, countries in the 50%-75% group a 27% reduction [-35%;-18%] and countries in the 75%-100% group a 37% reduction [-48%;-24%]. Additionally, a a 10% increase in the percentage of vaccinated people results in a reduction of the COVID-19 CFR of 6.56% [C.I. 95%: -8.48%;-4.61%]. Potential limitations of the analysis are due to the potential presence of time and spatial autocorrelation in the data.
There is strong statistical evidence pointing in the association of countries with higher COVID-19 vaccination rates and lower case mortality rate due to COVID-19. This implies that each country with current lower vaccination rates should promote a vaccination strategy in this year to reduce the death burden of the on-going COVID-19 pandemic.
I acknowledge the instructor Professor Shizhe Chen and his materials for statistical methods of research. I also acknowledge the following classmates for their valuable feedback and comments during the project development: Yinan Cheng, Shuyu Guo, Kyung Jin Lee, Katherine Cheng, Oscar Rivera and Jedidiah Harwood.
The following figure shows the distribution of the CFR per WHO region, with no transformation and with a logarithmic transformation. We can see that applying a logarithmic transformation on the outcome variable help the distribution become more Normally distributed.
df %>%
mutate(log_cfr=log(cfr)) %>%
pivot_longer(c(cfr, log_cfr), names_to = "key", values_to = "value") %>%
mutate(key=str_replace(key,"log_cfr","Log(CFR)"),
key=str_replace(key,"cfr","CFR")) %>%
ggplot(aes(value,fill=region_name))+
geom_density(alpha=.5)+
facet_wrap(region_name~key,scales = "free")+
labs(y="Density",x="Case Fatality Rate COVID-19 [%]",
fill="Region Name",
caption = "Cumulated cases and deaths for the period December 2020 to July 2022")
The following figure shows the distribution of the vaccination status per country, showing that within each WHO region we observe a bi-modal distribution. Except for Africa region, there are two groups of countries: ones with high vaccination rate and ones with lower vaccination rate.
# density of vaccination
df %>%
filter(month=="2022-7") %>% #July 2022
ggplot(aes(vaccinated,fill=region_name))+
geom_density(alpha=.5)+
facet_wrap(~region_name,scales = "free")+
labs(y="Density",x="People Fully Vaccinated per 100 people",
fill="Region Name",
caption = "Vaccination status at July 2022")
# Negative Binomial Model - Vaccination as categorical variable
# par(mfrow = c(2, 2))
# plot(model.nb2)
# boot::glm.diag.plots(model.nb2)
Here is the conducted test on dispersion, to see whether the data fitted to the model is more/less dispersed than expected. The test tells us that cannot reject the null hypothesis of no presence of over/under dispersion in the model. This means that the dispersion of the model is in line for what is expected for a negative binomial distribution.
# Dispersion test
# tests if the simulated dispersion is equal to the observed dispersion
a <- testDispersion(model.nb2) #looks good
Next we conduct a QQ-plot for a generalized linear model:
plotQQunif(model.nb2)
We observe that the dispersion of the residuals of the model is as expected for a negative binomial distribution. The outlier test reveals the presence of potential observations that may have an unusual value, which could be problematic for the model. Based on this, an additional model removing observations with values of CFR above 10% was conducted, which gives similar results to the original model.
# testZeroinflation() - tests if there are more zeros in the data than expected from the simulations
# testZeroInflation(model.nb2)
# testTemporalAutocorrelation() - tests for temporal autocorrelation in the residuals
# simulationOutput <- simulateResiduals(fittedModel = model.nb2)
# testTemporalAutocorrelation(simulationOutput,
# df_model$location_month)
# testOutliers() - tests if there are more simulation outliers than expectedtestOutliers() - tests if there are more simulation outliers than expected
# a <- testOutliers(model.nb2)
# testOutliers(model_noOutliers)
# influence plot
# outliers to remove
# car::influencePlot(model.nb2)
#
#
# df_aux <- df_model[c(-615,-631,-1285,-1332),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
# location+month+offset(log(cases)),
# data = df_aux)
# car::influencePlot(model.nb3)
#
# df_aux <- df_aux[c(-40,-456,-766,-822,-1508),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
# location+month+offset(log(cases)),
# data = df_aux)
# car::influencePlot(model.nb3)
#
# df_aux <- df_aux[c(-327,-835,-859,-931,-1546),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
# location+month+offset(log(cases)),
# data = df_aux)
# car::influencePlot(model.nb3)
#
# df_aux <- df_aux[c(-177,-627,-1040,-1274,-1355,-1439),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
# location+month+offset(log(cases)),
# data = df_aux)
# car::influencePlot(model.nb3)
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lme4_1.1-28 Matrix_1.4-0 DHARMa_0.4.5 flextable_0.7.0
## [5] skimr_2.1.3 plotly_4.10.0 gridExtra_2.3 lubridate_1.8.0
## [9] scales_1.1.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8
## [13] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.6
## [17] ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-155 fs_1.5.2 bit64_4.0.5 httr_1.4.2
## [5] repr_1.1.4 tools_4.1.3 backports_1.4.1 bslib_0.3.1
## [9] utf8_1.2.2 R6_2.5.1 DBI_1.1.2 lazyeval_0.2.2
## [13] colorspace_2.0-3 withr_2.5.0 tidyselect_1.1.2 curl_4.3.2
## [17] bit_4.0.4 compiler_4.1.3 cli_3.2.0 rvest_1.0.2
## [21] xml2_1.3.3 officer_0.4.1 labeling_0.4.2 sass_0.4.0
## [25] lmtest_0.9-39 systemfonts_1.0.4 digest_0.6.27 minqa_1.2.4
## [29] rmarkdown_2.13 base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.5.2
## [33] highr_0.9 dbplyr_2.1.1 fastmap_1.1.0 htmlwidgets_1.5.4
## [37] rlang_1.0.2 readxl_1.3.1 rstudioapi_0.13 farver_2.1.0
## [41] jquerylib_0.1.4 generics_0.1.2 zoo_1.8-9 jsonlite_1.8.0
## [45] crosstalk_1.2.0 vroom_1.5.7 zip_2.2.0 magrittr_2.0.1
## [49] Rcpp_1.0.8.3 munsell_0.5.0 fansi_1.0.2 gdtools_0.2.4
## [53] lifecycle_1.0.1 stringi_1.7.6 yaml_2.3.5 MASS_7.3-55
## [57] grid_4.1.3 parallel_4.1.3 crayon_1.5.0 lattice_0.20-45
## [61] haven_2.4.3 splines_4.1.3 hms_1.1.1 knitr_1.37
## [65] pillar_1.7.0 uuid_1.0-4 boot_1.3-28 reprex_2.0.1
## [69] glue_1.6.2 evaluate_0.15 data.table_1.14.2 modelr_0.1.8
## [73] vctrs_0.3.8 nloptr_2.0.0 tzdb_0.2.0 cellranger_1.1.0
## [77] gtable_0.3.0 assertthat_0.2.1 xfun_0.30 broom_0.7.12
## [81] viridisLite_0.4.0 ellipsis_0.3.2 gap_1.2.3-1
# knitr::opts_chunk$set(fig.pos = 'H')
# knitr::opts_chunk$set(fig.pos = 'H', message=F, echo = T, warning=F)
knitr::opts_chunk$set(fig.pos = 'H',
fig.width=12, fig.height=8,
message=F, echo = T, warning=F)
# libraries
library(tidyverse) # data manipulation
library(scales)
library(lubridate) #time format
library(gridExtra) # plot manipulation
library(plotly) # interactive plots
library(skimr) # summary statistics
library(flextable) #table format for hhtml
library(DHARMa) # Simulations test for GLM
theme_set(theme_bw(16)+theme(panel.grid.major = element_blank()))
# COVID-19 Data -----
# load WHO COVID-19 data
covid <- read_csv("https://covid19.who.int/WHO-COVID-19-global-data.csv")
# Data wrangling (preparation for analysis)
# shift deaths two weeks early
covid <- covid %>%
group_by(Country) %>%
mutate(deaths_shift=lead(New_deaths,14,
order_by = Date_reported)) %>%
ungroup()
# filter to remove March 2022 - Updated to August
covid <- covid %>% filter(Date_reported<"2022-08-01")
## Create time factor variable, splitting years in half
halfYear_levels <- c("2020-1","2020-2","2021-1",
"2021-2","2022-1","2022-2")
month_levels <- c("2020-12","2021-1","2021-2","2021-3",
"2021-4","2021-5","2021-6","2021-7","2021-8",
"2021-9","2021-10","2021-11","2021-12",
"2022-1","2022-2","2022-3","2022-4","2022-5",
"2022-6","2022-7")
covid <- covid %>%
mutate(period=case_when(
Date_reported < as.Date("2020-07-01") ~ "2020-1",
Date_reported < as.Date("2021-01-01") ~ "2020-2",
Date_reported < as.Date("2021-07-01") ~ "2021-1",
Date_reported < as.Date("2022-01-01") ~ "2021-2",
Date_reported < as.Date("2022-07-01") ~ "2022-1",
Date_reported < as.Date("2023-01-01") ~ "2022-2",
T ~ "other") %>% factor(levels=halfYear_levels),
month=paste(year(Date_reported),month(Date_reported),sep="-") %>%
factor(levels =month_levels))
# clean data: remove negative cases and deaths
covid <- covid %>%
mutate(New_deaths=deaths_shift) %>%
filter(New_cases>=0 & New_deaths>=0)
## need to summarize COVID data to periods of time: MONTHS
covid_period <- covid %>%
group_by(WHO_region,Country,month) %>%
summarise(cases=sum(New_cases),
deaths=sum(New_deaths)) %>% ungroup() %>%
filter(cases>99 & deaths>9) %>% # filter to remove unreliable periods
mutate(cfr=deaths/cases) # we calculate CFR for each given month
# VACCINATION DATA -----
## Load vaccination data
vaccination <- read_csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv")
# Filter to remove March 2022 - Updated to August 2022
vaccination <- vaccination %>%
filter(date<"2022-08-01")
# We summarized vaccination data based on the a time period: MONTH
## Create time factor variable, splitting years in half
vaccination <- vaccination %>%
mutate(period=case_when(
date < as.Date("2020-07-01") ~ "2020-1",
date < as.Date("2021-01-01") ~ "2020-2",
date < as.Date("2021-07-01") ~ "2021-1",
date < as.Date("2022-01-01") ~ "2021-2",
date < as.Date("2022-07-01") ~ "2022-1",
date < as.Date("2023-01-01") ~ "2022-2",
T ~ "other") %>% factor(levels=halfYear_levels),
month=paste(year(date),month(date),sep="-") %>% factor(levels=month_levels))
# take the average over the period: NA is 0
vaccination_period <- vaccination %>%
mutate(vaccinated=replace_na(people_fully_vaccinated_per_hundred,0)) %>%
group_by(location,month) %>%
summarise(vaccinated=mean(vaccinated,na.rm=T)) %>% ungroup()
# JOIN DATA -----
## Change countries name to increase matching (manually, based on visual inspection)
covid <- covid %>%
mutate(Country=case_when(
Country=="Bolivia (Plurinational State of)" ~ "Bolivia",
Country=="Bonaire" ~ "Bonaire Sint Eustatius and Saba",
Country=="Brunei Darussalam" ~ "Brunei",
Country=="Cabo Verde" ~ "Cape Verde",
Country=="Côte d’Ivoire" ~ "Cote d'Ivoire",
# str_detect(Country,"C.te d.Ivoire") ~ "Cote d'Ivoire",
Country=="Curaçao" ~ "Curacao",
# str_detect(Country,"Cura.ao") ~ "Curacao",
Country=="Democratic Republic of the Congo" ~ "Democratic Republic of Congo",
Country=="Falkland Islands (Malvinas)" ~ "Falkland Islands",
Country=="Faroe Islands" ~ "Faeroe Islands",
Country=="Iran (Islamic Republic of)" ~ "Iran",
Country=="Kosovo[1]" ~ "Kosovo",
Country=="Lao People's Democratic Republic" ~ "Laos",
Country=="occupied Palestinian territory, including east Jerusalem" ~ "Palestine",
Country=="Pitcairn Islands" ~ "Pitcairn",
Country=="Republic of Korea" ~ "South Korea",
Country=="Republic of Moldova" ~ "Moldova",
Country=="Russian Federation" ~ "Russia",
Country=="Sint Maarten" ~ "Sint Maarten (Dutch part)",
Country=="Syrian Arab Republic" ~ "Syria",
Country=="The United Kingdom" ~ "United Kingdom",
Country=="Timor-Leste" ~ "Timor",
Country=="United Republic of Tanzania" ~ "Tanzania",
Country=="United States of America" ~ "United States",
Country=="Venezuela (Bolivarian Republic of)" ~ "Venezuela",
Country=="Viet Nam" ~ "Vietnam",
T ~ Country))
# Join countries from both data sources
# countries_covid <- unique(covid$Country)
# countries_vaccine <- unique(vaccination$location)
#
# countries_vaccine <- data.frame(countries_vaccine=countries_vaccine,
# x=1)
# countries_covid <- data.frame(countries_covid=countries_covid,
# y=1)
# are the codes similar?
# countries_vaccine %>%
# left_join(countries_covid,by=c("countries_vaccine"="countries_covid")) %>%
# pull(y) %>% sum(na.rm=T)
# 215 countries match
# join based on location - I join it to the Vaccination, so I can get months with vaccine information for each country
df <- vaccination_period %>%
left_join(covid_period, by=c("location"="Country",
"month")) %>%
filter(cases>=0 & deaths>=0 & !(is.na(vaccinated)))
## More convenient labels for region
df <- df %>%
mutate(region_name=case_when(
WHO_region=="EMRO" ~ "Eastern Mediterranean",
WHO_region=="EURO" ~ "Europe",
WHO_region=="AFRO" ~ "Africa",
WHO_region=="AMRO" ~ "Americas",
WHO_region=="WPRO" ~ "Western Pacific",
WHO_region=="SEARO" ~ "South-East Asia") %>%
factor(levels=c("Americas","Europe","Eastern Mediterranean",
"Western Pacific","South-East Asia","Africa")),
location=as.factor(location))
# Vaccination level factor variable
df <- df %>%
mutate(vaccinated_level=case_when(
vaccinated<25 ~ "0%-25%",
vaccinated<50 ~ "25%-50%",
vaccinated<75 ~ "50%-75%",
vaccinated<100 ~ "75%-100%",
T ~"other") %>% factor(levels=c("0%-25%","25%-50%",
"50%-75%","75%-100%")),
vaccinated_level_10=case_when(
vaccinated<10 ~ "0%-10%",
vaccinated<20 ~ "10%-20%",
vaccinated<30 ~ "20%-30%",
vaccinated<40 ~ "30%-40%",
vaccinated<50 ~ "40%-50%",
vaccinated<60 ~ "50%-60%",
vaccinated<70 ~ "60%-70%",
vaccinated<80 ~ "70%-80%",
vaccinated<90 ~ "80%-90%",
vaccinated<101 ~ "90%-100%",
T ~"other") %>% factor(levels=c("0%-10%","10%-20%","20%-30%",
"30%-40%","40%-50%","50%-60%",
"60%-70%","70%-80%","80%-90%",
"90%-100%")))
# df$vaccinated_level_10 %>% table()
# function to estimate summary statistics:
f.sum <- function(x){
return(
data.frame(
Mean=mean(x,na.rm=T),
sd=sd(x,na.rm=T),
Min=min(x,na.rm=T),
Median=quantile(x,0.5,na.rm=T),
Max=max(x,na.rm = T)
)
)
}
# average number of months per county:
avg_country_month <- df %>%
group_by(location) %>% tally()
# Table with summary statistics
rbind(`Number of months per country`=f.sum(avg_country_month$n),
`COVID-19 Cases per Month`=f.sum(df$cases),
`COVID-19 Deaths per Month`=f.sum(df$deaths),
`Case Fatality Rate (CFR) %`=f.sum(df$cfr*100),
`People Fully Vaccinated (%)`=f.sum(df$vaccinated)) %>%
rownames_to_column() %>% rename(Metric=rowname) %>%
flextable() %>% autofit() %>%
colformat_double(digits=2) %>%
colformat_double(i=2:3,digits=0) %>%
bold(part = "header") %>%
set_caption(paste0(
"Summary Metrics: Monthly observations per country. n = ",nrow(df))) %>%
footnote(i=1,j=1,
as_paragraph(paste0(length(unique(df$location)),
" Countries considered")))
rm(avg_country_month)
#### Time series of COVID-19 Cases and deaths
p_region <- df %>%
pivot_longer(c(cases,deaths), names_to = "key", values_to = "value") %>%
ggplot(aes(month,value))+
# geom_line()+
geom_boxplot()+
facet_wrap(~key,ncol=1,scales = "free_y")+
labs(x="Date",y="")+
scale_y_continuous(labels=function(x) format(x,
big.mark = " ",
scientific = FALSE),
trans = "log10")
# p_region
p_cfr_time <- df %>%
ggplot(aes(month,cfr))+
geom_boxplot()+
scale_y_continuous(labels = scales::percent,
limits = c(0,0.1))+
labs(x="Date",y="Case Fatality Rate COVID-19 [%]")+
theme(axis.text.x = element_text(angle =-45,vjust=-0.5))
# p_cfr_time
p_vacc_time <- df %>%
ggplot(aes(month,vaccinated))+
geom_boxplot()+
labs(x="Date",y="People Fully Vaccinated per 100 people")+
theme(axis.text.x = element_text(angle =-45,vjust=-0.5))
# p_vacc_time
## combine graphs
# grid.arrange(p_cfr_time,p_vacc_time,ncol=1)
p_cfr_all<- df %>%
filter(WHO_region!="Other") %>%
ggplot(aes(region_name,cfr))+
geom_jitter(alpha=0.5,aes(text=paste0("Location: ",location,
"\n Month: ",month)))+
geom_boxplot(alpha=0.5,outlier.alpha = 0)+
coord_flip()+
scale_y_continuous(labels = scales::percent, limits=c(0,0.1))+
labs(x="WHO Region",y="Case Fatality Rate COVID-19 [%]",
caption = "Cumulated cases and deaths for the period December 2020 to July 2022 \n Only CFR below 10% are shown")
ggplotly(p_cfr_all)
# percentage above 10%
# sum(nrow(filter(df,cfr>0.1)))/nrow(df)*100
# Time series CFR - NEED TO CONSIDER WEIGHTED AVERAGE
p_cfr_time <- df %>%
group_by(month,region_name) %>%
summarize(cfr=mean(cfr,na.rm=T)) %>%
ggplot(aes(month,cfr,
col=region_name,
group=region_name
))+
geom_line(size=1.5)+
# geom_boxplot()+
scale_y_continuous(labels = scales::percent,
limits = c(0,0.1),
# breaks=c(0,0.05,0.1)
)+
theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
labs(x="",y="Case Fatality Rate COVID-19 [%]",col="WHO Region")
ggplotly(p_cfr_time)
p_vacc_all <- df %>%
filter(month=="2022-7") %>% #July 2022
ggplot(aes(region_name,vaccinated))+
geom_boxplot(alpha=0.5,outlier.alpha = 0)+
geom_jitter(alpha=0.5, size=2,
aes(text=paste0("Location: ",location)))+
coord_flip()+
labs(x="WHO Region",y="People Fully Vaccinated [%]",
caption = "Vaccination status at July 2022")
ggplotly(p_vacc_all)
# time series of vaccination - NEED TO CONSIDER WEIGHTED AVERAGE
# df %>%
# group_by(month,region_name) %>%
# summarize(vaccinated=mean(vaccinated,na.rm=T)) %>%
# ggplot(aes(month,vaccinated,
# col=region_name,
# group=region_name
# ))+
# geom_line()+
# # geom_boxplot()+
# theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
# labs(x="",y="People Fully Vaccinated per 100 people",col="WHO Region")
p_vac_cfr_box <- df %>%
filter(!(is.na(WHO_region))) %>%
ggplot(aes(region_name,cfr,col=vaccinated_level))+
geom_boxplot()+
scale_y_continuous(labels = scales::percent,
limits = c(0,0.1)
)+
coord_flip()+
# theme(axis.text.x = element_text(angle =-45,vjust=-0.5))+
labs(col="Percentage of People \n Fully Vaccinated [%]",
x="WHO Region",
y="Case Fatality Rate COVID-19 [%]",
caption="Each point represents a country summary statistic for each month")
# ggplotly(p_vac_cfr_box)
p_vac_cfr_box
# Model
df_model <- df %>%
select(deaths,cfr,cases,vaccinated,vaccinated_level,
vaccinated_level_10,location,month) %>%
mutate(location_month=paste0(month,location)) %>%
na.omit()
# Model with categorical vaccinated level
model.nb2 <- MASS::glm.nb(deaths~vaccinated_level+
location+month+offset(log(cases)),
data = df_model)
# summary(model.nb2)
point_vac <- exp(model.nb2$coefficients[2:4])
ci_vac <- exp(confint(model.nb2,2:4, level=0.95))
# LR test
model.nb1 <- MASS::glm.nb(deaths~location+month+offset(log(cases)),
data = df_model)
test_lr <- lmtest::lrtest(model.nb1,model.nb2)
estimates <- data.frame(mid=point_vac,
lower_bound=ci_vac[,1],
upper_bound=ci_vac[,2])
names(estimates) <- c("Estimated Coefficient",
"Lower Bound C.I. 95%",
"Upper Bound C.I. 95%")
estimates <- estimates %>%
rownames_to_column() %>%
rename(Variable=rowname) %>%
mutate(Variable=str_replace(Variable,"vaccinated_level",
"Vaccination Level: "))
estimates %>%
rename(RR=`Estimated Coefficient`) %>%
flextable() %>%
autofit() %>%
colformat_double(digits=2) %>%
set_caption("Relative Risk (RR): Vaccination Level")
estimates %>%
mutate(Variable=str_remove(Variable,"Vaccination Level: ")) %>%
ggplot(aes(Variable, `Estimated Coefficient`))+
geom_point(col="red")+
geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
ymax=`Upper Bound C.I. 95%`))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x="Vaccination Level",y="Relative Risk (RR): Vaccination Level")
rm(estimates)
# Function to plot estimated coefficients along their error bar
f_plot_coef <- function(mod,
range=2:4,
var_name="vaccinated_level",
var_title="Vaccination Level",
title_plot=""){
point <- exp(mod$coefficients[range])
ci <- exp(confint(mod,range, level=0.95))
estimates <- data.frame(mid=point,
lower_bound=ci[,1],
upper_bound=ci[,2])
names(estimates) <- c("Estimated Coefficient",
"Lower Bound C.I. 95%",
"Upper Bound C.I. 95%")
estimates <- estimates %>%
rownames_to_column() %>%
rename(Variable=rowname) %>%
mutate(Variable=str_replace(Variable,var_name,
paste0(var_title,": ")))
# Plot
estimates %>%
mutate(Variable=str_remove(Variable,paste0(var_title,": "))) %>%
ggplot(aes(Variable, `Estimated Coefficient`))+
geom_point(col="red")+
geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
ymax=`Upper Bound C.I. 95%`))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x=var_title, y=paste0("Relative Risk (RR): ",
var_title),title = title_plot)
}
location_coefficients <- data.frame(coef=exp(model.nb2$coefficients)) %>%
rownames_to_column() %>%
filter(str_detect(rowname,"location")) %>%
mutate(location=str_remove(rowname,"location"))
# Plot with WHO region
country_region <- df %>%
group_by(region_name,location) %>% tally() %>% ungroup()
location_coefficients %>%
left_join(country_region) %>%
ggplot(aes(region_name,coef,col=region_name))+
geom_jitter(alpha=.5)+
geom_boxplot()+
geom_hline(yintercept = 1, linetype="dashed")+
coord_flip()+
guides(col=F)+
labs(x="WHO Region",col="",y="Relative Risk (RR): Country")
size_coef <- length(model.nb2$coefficients)
# get coefficients for month
point_month <- exp(model.nb2$coefficients[(size_coef-13):(size_coef)])
ci_month <- exp(confint(model.nb2,(size_coef-13):(size_coef), level=0.95))
estimates_month <- data.frame(mid=point_month,
lower_bound=ci_month[,1],
upper_bound=ci_month[,2])
names(estimates_month) <- c("Relative Risk (RR)",
"Lower Bound C.I. 95%",
"Upper Bound C.I. 95%")
estimates_month %>%
rownames_to_column() %>%
mutate(rowname=str_remove(rowname,"month"),
month=factor(rowname, levels=rev(month_levels))) %>%
ggplot(aes(month, `Relative Risk (RR)`))+
geom_point(col="red")+
geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
ymax=`Upper Bound C.I. 95%`))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x="",y="Relative Risk (RR): Month. Reference variable: 2020-12")
# OTHER MODELS ----
## Negative Binomial model - NUMERIC
model.nb <- MASS::glm.nb(deaths~vaccinated+location+month+offset(log(cases)),
data = df_model)
# summary(model.nb)
# exp(model.nb$coefficients[2]*10)
# exp(confint(model.nb,"vaccinated", level=0.95)*10)
## Poisson model
model_poisson <- glm(deaths~vaccinated_level+location+month+offset(log(cases)),
data = df_model,
family = "poisson")
# summary(model_poisson)
# vaccinated variable
# exp(model_poisson$coefficients[2:4])
# exp(confint(model_poisson,2:4, level=0.95))
p_poisson <- f_plot_coef(model_poisson,
title_plot = "Outcome as Poisson Distribution")
## 10 levels categorical model ------
model.10levels <- MASS::glm.nb(deaths~vaccinated_level_10+
location+month+offset(log(cases)),
data = df_model)
# exp(model.10levels$coefficients[2:10])
# exp(confint(model.10levels,2:4, level=0.95))
p_10 <- f_plot_coef(model.10levels,2:10,
var_name = "vaccinated_level_10",
title_plot = "10 Categories for Vaccination Level")
# p_10
## without outliers (below 10% CFR) ------
model_noOutliers <- MASS::glm.nb(deaths~vaccinated_level+
location+month+offset(log(cases)),
data = filter(df_model,cfr<0.1))
# nobs(model_noOutliers)
# exp(model_noOutliers$coefficients[2:4])
# exp(confint(model_noOutliers,2:4, level=0.95))
p_noOut <- f_plot_coef(model_noOutliers,
title_plot = "Only CFR<10% considered")
# p_noOut
grid.arrange(p_10,p_noOut,nrow=1)
## country as random effects ------
library(lme4)
# takes forever to compute - option: load it
# model_random <- lme4::glmer.nb(deaths~(1|location)+vaccinated_level+
# month+offset(log(cases)),
# data=df_model)
# saveRDS(model_random,"randomModel.rds")
model_random <- read_rds("../randomModel.rds")
# nobs(model_random)
# ranef(model_random)
point <- exp(fixef(model_random)[2:4])
ci <- exp(confint(model_random,3:5,method="Wald", level=0.95))
estimates <- data.frame(mid=point,
lower_bound=ci[,1],
upper_bound=ci[,2])
names(estimates) <- c("Estimated Coefficient",
"Lower Bound C.I. 95%",
"Upper Bound C.I. 95%")
estimates <- estimates %>%
rownames_to_column() %>%
rename(Variable=rowname) %>%
mutate(Variable=str_replace(Variable,"vaccinated_level",
paste0("Vaccinated Level",": ")))
# Plot
p_random <- estimates %>%
mutate(Variable=str_remove(Variable,paste0("Vaccinated Level",": "))) %>%
ggplot(aes(Variable, `Estimated Coefficient`))+
geom_point(col="red")+
geom_errorbar(aes(ymin=`Lower Bound C.I. 95%`,
ymax=`Upper Bound C.I. 95%`))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x="Vaccinated Level", y=paste0("Relative Risk (RR): ",
"Vaccinated Level"),
title = "Mixed Linear Effects Model (Country as Random)")
# Bootstrap
# Source: https://stackoverflow.com/questions/54749641/bootstrapping-with-glm-model
# it took really really long!
# UNCOMMENT TO GENERATE THE OBJECT AGAIN if not, load it
# data structure for results
# nboot <- 1000
# bres <- matrix(NA,
# nrow=nboot,
# ncol=4,
# dimnames=list(rep=seq(nboot),
# coef=names(coef(model.nb2))[1:4]))
# # bootstrap
# set.seed(101)
# bootsize <- nrow(df_model)
# df_boot <- df_model
# for (i in seq(nboot)) {
# bdat <- df_boot[sample(nrow(df_boot),size=bootsize,replace=TRUE),]
# bfit <- update(model.nb2, data=bdat) ## refit with new data
# bres[i,] <- coef(bfit)[1:4]
# }
# # output
# saveRDS(bres,"boot.rds")
bres <- readRDS ("../boot.rds")
boot_estimates <-
data.frame(mean_est=colMeans(exp(bres)),
t(apply(exp(bres),2,quantile,c(0.025,0.975))))
# plot
p_boot <- boot_estimates[2:4,] %>%
rownames_to_column() %>%
mutate(Variable=str_remove(rowname,"vaccinated_level")) %>%
ggplot(aes(Variable, mean_est))+
geom_point(col="red")+
geom_errorbar(aes(ymin=X2.5.,
ymax=X97.5.))+
coord_flip()+
geom_hline(yintercept = 1, linetype="dashed")+
labs(x="Vaccinated Level", y=paste0("Relative Risk (RR): ",
"Vaccinated Level"),
title = "Bootstrap estimates of Original Model")
grid.arrange(p_random,p_boot,nrow=1)
df %>%
mutate(log_cfr=log(cfr)) %>%
pivot_longer(c(cfr, log_cfr), names_to = "key", values_to = "value") %>%
mutate(key=str_replace(key,"log_cfr","Log(CFR)"),
key=str_replace(key,"cfr","CFR")) %>%
ggplot(aes(value,fill=region_name))+
geom_density(alpha=.5)+
facet_wrap(region_name~key,scales = "free")+
labs(y="Density",x="Case Fatality Rate COVID-19 [%]",
fill="Region Name",
caption = "Cumulated cases and deaths for the period December 2020 to July 2022")
# density of vaccination
df %>%
filter(month=="2022-7") %>% #July 2022
ggplot(aes(vaccinated,fill=region_name))+
geom_density(alpha=.5)+
facet_wrap(~region_name,scales = "free")+
labs(y="Density",x="People Fully Vaccinated per 100 people",
fill="Region Name",
caption = "Vaccination status at July 2022")
# Negative Binomial Model - Vaccination as categorical variable
# par(mfrow = c(2, 2))
# plot(model.nb2)
# boot::glm.diag.plots(model.nb2)
# Dispersion test
# tests if the simulated dispersion is equal to the observed dispersion
a <- testDispersion(model.nb2) #looks good
plotQQunif(model.nb2)
# testZeroinflation() - tests if there are more zeros in the data than expected from the simulations
# testZeroInflation(model.nb2)
# testTemporalAutocorrelation() - tests for temporal autocorrelation in the residuals
# simulationOutput <- simulateResiduals(fittedModel = model.nb2)
# testTemporalAutocorrelation(simulationOutput,
# df_model$location_month)
# testOutliers() - tests if there are more simulation outliers than expectedtestOutliers() - tests if there are more simulation outliers than expected
# a <- testOutliers(model.nb2)
# testOutliers(model_noOutliers)
# influence plot
# outliers to remove
# car::influencePlot(model.nb2)
#
#
# df_aux <- df_model[c(-615,-631,-1285,-1332),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
# location+month+offset(log(cases)),
# data = df_aux)
# car::influencePlot(model.nb3)
#
# df_aux <- df_aux[c(-40,-456,-766,-822,-1508),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
# location+month+offset(log(cases)),
# data = df_aux)
# car::influencePlot(model.nb3)
#
# df_aux <- df_aux[c(-327,-835,-859,-931,-1546),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
# location+month+offset(log(cases)),
# data = df_aux)
# car::influencePlot(model.nb3)
#
# df_aux <- df_aux[c(-177,-627,-1040,-1274,-1355,-1439),]
# model.nb3 <- MASS::glm.nb(deaths~vaccinated_level+
# location+month+offset(log(cases)),
# data = df_aux)
# car::influencePlot(model.nb3)
sessionInfo()